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Bayesian Detection of Changepoints in 
Finite-State Markov Chains for Multiple 

Sequences 


Fetter Arnesen, Tracy Holsclaw and Padhraic Smyth 1 

We consider the analysis of sets of categorical sequences consisting of piecewise homoge¬ 
neous Markov segments. The sequences are assumed to be governed by a common underlying 
process with segments occurring in the same order for each sequence. Segments are dehned by a 
set of unobserved changepoints where the positions and number of changepoints can vary from 
sequence to sequence. We propose a Bayesian framework for analyzing such data, placing priors 
on the locations of the changepoints and on the transition matrices and using Markov chain 
Monte Carlo (MCMC) techniques to obtain posterior samples given the data. Experimental 
results using simulated data illustrates how the methodology can be used for inference of poste¬ 
rior distributions for parameters and changepoints, as well as the ability to handle considerable 
variability in the locations of the changepoints across different sequences. We also investigate 
the application of the approach to sequential data from two applications involving monsoonal 
rainfall patterns and branching patterns in trees. 

Key words: Changepoint model; Cross-validation; Hidden Markov model; Multiple sequences. 


1. INTRODUCTION 


Finite-state Mar 


wea 


ics flDurbin et al 


cov chains are widely used to model sequential data in applic ations such as 


her models flGabriel and Neumann 


199811 ■ and more flGuttorp 


1962), spe ech recognition flRabineiill989l) . bioinformat 


19951) . A common assumption is that the chain is 


homogeneous, often motivated by a desire to keep the number of model parameters tractable. 
In practice, however, inhomogeneity in various forms is often present. 

In particular in this paper we investigate the problem of modeling sets of categorical-valued 
sequences where each sequence is assumed to be generated by an ordered set of segments, with 
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Figure 1: 31 years of rainfall categorized into no/light rainfall (white), medinm rainfall (gray), 
and heavy rainfall (black). 

nnobserved segment bonndaries that can vary from seqnence to seqnence. Each segment has 
its own Markov dynamics, representing common “phases” for some nnderlying process. As a 


specihc example, consider the modeling of rainfall at a particn. 


a long history of use as stochastic rnodels of rainfall (lNewnha.m 


ar location. Mar 


1916 


Gold 


1929 


iov chains have 


Cochran 


1938 


Wilks and Wilbv 

1999; 

Ghen et al. 

201Q) 


29101 ). Figure [U is an illustration of annual daily rainfall 


sequences for a weather station in Northern India. The appearance of the Indian monsoon 
in early summer causes a visible phase-change in the sequence of daily rainfall for each year, 
motivating the type of segment-based model that is the focus of this paper. Being able to 
infer monsoon onset and withdrawal dates, and to effectively handle int erannual variability, is 


of considerable interest in agriculture modeling and climate science (e.g.. 


Joseph et al. 


111994 1. 


Other examples of data sets characte rized by multip 


ture include tree branchin g patterns lIGuedon et al 


Rafterv and Tavare 


Fearnhead and Liu 


1994 . wind-speed data (IBerchtold 


e seq uences with common seg mental struc- 


20071), and dialog transcripts (iLevin and Pieraccini 


200111 ■ se quences of bird so ngs llCraig 


I999II. DNA seq 


uence s (IBerchtold 


1943 


2002 


20001 ) . 


In this paper we model a particular type of shared structure where a categorical sequence 
is assumed to consist of A' -F 1 segments separated by K unobserved changepoints. Within 
each segment the observed data are assumed to be generated by a homogeneous finite-state 
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Markov chain. We will assume that each of the K + 1 segments has its own Markov parameters, 
resulting in +1 transition matrices. The location of each changepoint is modeled by a discrete 
distribution conditioned on the location of the previous changepoint. We will consider the case 
of a set of L observed sequences (of potentially varying length) where the locations of the K 
changepoints can vary across the sequences. The L sequences share information in that the 
parameters of the Markov transition matrices are global, and the order in which the segments 
appear in each sequence are assumed to be the same, corresponding to common underlying 
phases. Our framework allows any segment to be skipped in a particular sequence, and thus, 
K is in effect the maximal number of changepoints that can occur in any particular sequence. 

There is a large amount of prior literature on changepoint detection for sequential data, 
covering a variety of aspects of this problem, including online/offline estimation, single versus 
multiple sequences, Bayesian versus non-Bayesian inference, and methods for inferring the num¬ 
ber of changepoints. Much of this prior work has focused on the case of independent observations 
and/or nqn-cate gorical data. An exception is the double-Markov chain approach proposed by 


Berchtoldl (1l999l ) using the Expectation-Maximization ( EM) algorithm for inferenc e , and which 


was further developed within a Bayesian framework by 


Fitzpatrick and MarchevI (120131 ) . Be 


cause of the Markov assumption at the segment-transition level, the model imposes an implicit 
geometric distribution on segment lengths which is not always ideal in practice (we will followup 
on this point later in the paper). Th e geometric ass umption can be relaxed by using a hidden 


semi-Markov approach for example flGuedon 


20031) . but this imposes computational restric- 


tions, namely t hat the time complexity of inference (e.g., via the forward-backward algorithm. 


Ra 


Diner 


Polanskv 


fll989)) can scale as O(T^) where T is the length of a sequence. 


(120071) proposed a likelihood-based framework for segmentation assuming a pro¬ 


cess that switches between different Markov chains with unknown parameters, but for the 
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case of a single sequence rather than multiple sequences. A restricted variant of the multi¬ 
ple sequence changepoint problem is the case where sequences are required to be of the same 


can also be viewed as a sing le multivariate sequence fiLai and Xing 


Fitzpatrick and Marchev 


20181 1 ■ 


same location 

in eac 

'a sequence. 

which 

(Lai and Xing 

2011; 

Xing et al. 

2012; 


Zhang and Siegmnndl (1201211 relax this approach to allow sub¬ 


sets of sequences (of equal length and for the case of independent real-valued observations) to 
share changepoints. In the approach proposed here we allow each sequence to have its own 
changepoint locations and sequences to have different lengths. 

While most prior work has focused on likelihood-based inference and point estimates of 
changepoints, here we use full Bayesian inference for both changepoint locations and model 


parameters. Our MCMC inference algorithm has a time complexity per iteration 
i n the l ength T of a seque n ce, ay oiding the O(T^) time complexity incurred by 


hat is linear 


(I2OI2I) 


Rigaill et ah 


Fearnhead and Liul (120111 1 developed a Bayesian changepoint detection that is linear in 


T in terms of time complexity, but for the case of a single real-valued sequence. 

Our focus on multiple sequences also provides a natural context for using cross-validation 


(across sequences) 


the B IG criterion (iGuedon 


2018 1 


2012 


or model sele c tion, providing a practical alternative to 


2008 


Polanskv 


2007 


Fitzpatrick and Marchev 


and related penalizec 


Clevnen and Lebarbiei 


likeli hood approaches (jZhang and Siegmund 


approaches such as 


2013 


2007 


uong et ah 
Rigaill et ah 


201311 . which are not always appr opriate or effective for change- 


point problems (e.g., see the discussion in 


Clevnen et al. 


(1201411 1 


In Section [ 2 ] we introduce our proposed changepoint Markov model, with results on model 
selection in Section |3l In Section 14.11 we present an example of simulated data to illustrate 
different aspects of the proposed approach. Section |4]2] describes the application of our approach 
to a data set of annual branching of apple trees, and in Section 14.81 we discuss the application 
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of the model to a data set consisting of mnltiple years of daily rainfall in India. We conclnde 
the paper in Section [5] with a brief discnssion of the main results as well as suggestions for 
future directions. Discussions on MCMC sampling, how to handle missing data and additional 
simulated examples are provided in the supplementary materials. 

2. A CHANGEPOINT MARKOV MODEL FOR CATEGORICAL 

SEQUENCES 

In this section we dehne our Bayesian model for detection of a hxed number of changepoints 
in multiple sequences. The MCMC sampling algorithm we construct for sampling from the 
resulting posterior distribution is given in Section S.l in the supplementary materials. 

2.1 Homogeneous Markov Chains 

Given a sequence of discrete observations y = (yi, ..., yr), such that yj G {!,..., 77 ,} for all 
j = 1, ...,T, y is a realization from a hnite-state Markov chain if the joint probability of y can 
be written as 

T T 

p{y) = Pi{yi)Ylpj{yj\yj-i, -^yi) = YlPj{yj\yj-i), (i) 

i=2 j=i 

where the conditional distribution of yj given ..., ?/i only depends on the previous state 
yj-i, and we assume some given initial state 7/0 such that Pi{yi) = Pi{yi\yo)- If the transition 
distribution pj(yj\yj_i) = p{yj\yj_i) for all j = 1,...,T, we say that the Markov chain is 
homogeneous. The transition probabilities can be organized into an n x n transition matrix Q, 
where the rows indicate the value 7/j_i and the columns represent the values of yj. We relax 
these assumptions below to allow the Markov chain to be piecewise homogeneous, allowing 
inhomogeneity via local segmentation of the sequence y. 
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2.2 Modeling Local Segments 


To define a piecewise homogeneous Markov chain, we divide the T observations into K + 1 
segments by introducing K changepoints tq = 0 < ti < ... < < T, and write r = (ri,..., tk) 

as the vector of changepoints. If rj_i < Tj then we denote Sj = {rj_i + 1, to be the 

segment and if rj_i = (a segment of length zero) then Si = 0. Introducing the notation 
Vs = {yj\3 G s) for s C {1, ...,T}, the segment will include the data points t/^.. We assume 
the locations of the K changepoints to be generated by a distribution of the form 

K 

p{t\T, 0) =p{Ti,...,TK\T,e) = Y[p{Ti\Ti-i,T,0i), (2) 

i=l 

where p(ri|rj_i, T, is a discrete parametric distribution for changepoint r*, with parameter 
vector and where we let 6 denote the collection of the parameter vectors 6i for i = 1,..., iL. 
In the remainder of the paper we assume the distribution for Tj to be the negative binomial 
distribution truncated to the interval (ri_i,T) (additional details in Section [2.4p . The two- 
parameter negative binomial distribution provides additional flexibility in modeling segment 
lengths compared to a single parameter distribution such as the geometric distribution. This 
can lead to more accurate detection of changepoints, as we will see later in the paper. 

2.3 The Piecewise Homogeneous Markov Chain 

We assume that within each segment the sequential observations in y evolve according to a hxed 

transition matrix, i.e., y is generated by a piecewise homogeneous Markov chain within each 

segment. With K changepoints, we have a total of A" -|- 1 transition matrices ..., 

each having size nxn. In what follows we present the case where the A' -|- 1 transition matrices 

are modeled separately and independently, but it is straightforward to constrain some of these 
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matrices to be the same (as in our rainfall example later in the paper) and to reduce the 
parameter count accordingly. Let Q denote the collection of these K + 1 transition matrices, 
and let denote the element at the row and the column of matrix i. Given an initial 
state I/O; the data likelihood is 

T K +1 

p(y|T, Q,|/o) = Wp{yj\yj-i, Q; 't) = n n (3) 

j=l i=l j€si 

Assume for the moment that no segments are of length zero, i.e., no segments are skipped. If 
yj is the hrst data point in segment Si, i > 1, then we assume it to be distributed according to 
the conditional distribution of yj given yj_i (the last point in segment using the transition 
matrix for segment Si, Qh). The extension to segments of length zero is straightforward. 

We adopt a fully Bayesian approach conditioned on hxed K for our model. For priors for 
the transition matrices we assume each of the rows to be independently distributed according 
to the Dirichlet distribution. In particular, ~ Dir{cx.^^^), where is the row of the 
transition matrix and where Dir{cx.^J^) denotes the Dirichlet distribution with parameter vector 
of appropriate length. We set all elements in to be equal to 1, for all i and k in our 
examples, rendering the prior equivalent to having seen one transition from each category to 
every other category including itself. 

2.4 Modeling Multiple Sequences of Variable Length 

To handle multiple sequences, consider L conditionally independent sequences of observations, 
I = 1,...,L, with lengths Ti,...,Tl, where each sequence consists oi K + 1 segments 
occurring in the same order as described above. Also let |/q ^ denote the initial value for each 
of the I sequences. Assuming the sequences to be conditionally independent, the likelihood for 
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multiple sequences is the product of the likelihoods for each individual sequence (as dehned 
earlier). Segments are allowed to be of zero length (i.e., skipped), effectively allowing the 
number of changepoints per sequence to differ (see the simulation studies in Section S.3.1 and 
S.3.2 in the supplementary material for examples of this property). Let = (r{^\..., r^^) 
denote the changepoints in sequence /. There are a number of options for modeling how the 
locations of the changepoints are related to the lengths of the sequences Ti. One could allow 
the changepoints to have distributions dehned in absolute units (e.g., of time) and treat the 
total length of the sequence as a random quantity. The approach we take here is to assume that 
the distribution on changepoints (or eqnivalently, on segment length) is specihed in terms of 
position relative to the total length of the seqnence, where we treat the observed total sequence 
lengths Ti as hxed quantities and condition on them. In practice the choice of parametrization 
will depend on the specihc nature of the application, and for the special case where the sequences 
are all of the same length, the absolute and relative approaches will be eqnivalent. In particular 
we assume that the position of the changepoints have a negative binomial distribution, with 
parameters 0i = {ri,bi) G (0,1) x (0,1), truncated to the range (rj*'l\,Tj). We write 


-(0 I _(0 Q \ r(L^^^ + l{0h Ti)) (i) _ (i) 


piTy\T>J^,Ti,ei) OC 


r(7(0*,TO) 




(4) 


where 'y{6i,Ti) is dehned via the expression r^Ti = 7 ( 0 j,T;)(l — bi)/bi, which corresponds to 
the expected value of the negative binomial distribution withont trnncation. The parameter r* 
will therefore be related to the expected position of changepoint scaled by the length Ti of 
seqnence /, while 6* will be related to variance of the distribution. This trnncated distribution 
can be efficiently compnted as it does not include a computationally demanding normalizing 
constant. We also assume apriori that the 0j, i = 1, ...,K are independent, and that and 6^ 
are independent and nniformly distributed in the unit interval, 17(0,1). 



3. MODEL SELECTION AND COMPARISON 


Previ ous work on estimating 


BIC (IGuedon 


2003; 


Polanskv 


20071: 


Fitzpatrick and Marchev 


often used formulat 

ions s 

2913: 

Luong et al. 

2013) 


BIC criterion, howev e r, is not directly applicable to changepoint problems (as discussed in 


Zhang and Siegmundl (120071)1. 


instance 


Zhang and Siegmundl fl2007h 


l eading to alternative penalized like 


Clevnen and Lebarbierl (120131) and 


ihood formulations, see for 


Clevnen et ah 


More fully 


Bayesian approache s have also been pursued (for example, by 


( 2011 1 and 


Rigaill et al 


(1201411 ■ 


Fearnhead and Liu 


(1201211 1 but for single-sequences with real or count-valued IID obser¬ 


vations. 

For multiple sequences, 


Xing et al 


(I 2 OI 2 I ) investigate a Bayesian approach for inferring 


boundaries for piecewise homogeneous Markov chains with unobserved changepoints, similar 
to the problem we address in this paper. However, their approach assumes that all sequences 
are the same length and have changepoints in the same position, essentially restricting the 


approach to the case o 


Zhang and Siegmund 


a sin gle sequence with a multivariate distribution. For IID observations 


201211 also consider the scenario of multiple equi-length sequences with 


aligned changepoints across sequences, using a modihed BIC criterion for model selection. 

An alternative approach that could be pursued is that of transdimensional MCMC algo - 


rithms based on a joint posterior distribution for 


Reversible jump MCMC (RJMCMC) algorithms (jCreenlll995h could be used in this context, but 


he model a nd parameter space ([Green 


19951). 


we would anticipate slow convergence and potential sensitivity issues when choosing proposal 
distributions (particularly for the jump proposals). Another option is the Dirichlet process 


framework, although it would not be straightforward to apply th is approac 


tational challenges that would result from the lack of conjugacy (lNealll2000[l . 


1 given the compu- 
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Given these various issues with Bayesian and information-based criteria (at least in the 


con text of our proposed model), we instead use cross-validation with log-probab i 


see 


Vehtari and T;a.mpinenl (120021) : 


Gneiting and RaftervI ( 200711 : 


Gelman et ah 


ity sc o res fe.g.. 


(12014; 


Li et ah 


(1201411 1 for both choosing the number of changepoints and for model selection among alternative 


frameworks. We take advantage of the fact that we are working with multiple sequences and 
use cross-validation at the sequence level, using the log-probability of held-out sequences as our 
scoring function. Our simulation results (next section) suggest that this approach is feasible 
even with a relatively small number of sequences (e.g., 10). When training our model, we hold 
out t of our L sequences, referred to as the test set, and train our model using only L — t of the 
available sequences, referred to as the training set. Denote the L — t sequences in a training set 
by Yf) and the t sequences in the test set by Y_£). To evaluate the quality of the model, we 
estimate the logarithm of the probability of observing the test set Y^d given the training set 
Yd and the model. We average this log-probability for a number of different train/test set pairs 
using cross-validation. The Monte Garlo estimate of the log-probability of one of the sequences 
in the test set is 


lnp{y^^^\YD) = (5) 

i=i j=i 

where ~ |T;, 0^) for j = 1,...,M, and (0W,QW) ~p(0, QIYd) fori = l,...,iV, and 

where we use superscript [•] to denote simulated samples. After convergence of the algorithm 

and for N independent samples of 9 and Q, we simulate M samples from 9). For each 

of these M samples, we calculate the value of the logarithm of the data likelihood in ([3]), and then 

compute the mean. For the results in this paper we used M = 1000 samples, although sensitivity 

analysis (not shown) indicates that M = 100 is sufficient to obtain consistent estimates of the 

log-probability score. For N, which is the number of independent samples from the MGMG 
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algorithm we used N = 1000 following the discussion in Section S.l in the supplementary 
materials on burn-in period and thinning. For all of our results we use 10-fold cross-validation, 
i.e. we partition our data into 10 test and training sets (unless otherwise stated). 

The total log-probability score is dehned to be the sum of the individual scores divided by 
the sum of the length of the sequences 


S{Y.d\Yd) 


E,'., r, 


( 6 ) 


where are the sequences in the test set. Using the equation above we can compare 

different models, either our proposed model with different numbers of changepoints, or our pro¬ 
posed model versus an alternative model. When calculating the cross-validated log-probability 
scores for two models A4.i and Ai 2 we use the same train/test sets and report the difference 
between the scores for each train/test set, i.e., we calculate S'(yL£)| Yd, Wli) — S'(Y_d| Yd, Wl 2 ) 
for each train/test set. 


4. ANALYSIS AND RESULTS 

In this section, we analyze one simulated and two real data sets. For the simulated data 
we generate the sequences and changepoints, allowing us to test our model and compare to 
alternative methods. Additional simulated data sets are provided in the supplementary material 
to further explore different aspects of our model. The two real data sets consists of branching 
patterns for apple trees and daily rainfall over a region Northern India. 

In all of our analyses we compare our method to three baseline models. The first baseline SI 
(Segmental-Independence) is the same as our proposed model but assumes the observations y 
are independent within each segment. The second baseline is a standard hidden Markov model 
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(HMM) with a left-to-right transition matri x. The third baseline is the donble hidden Markov 


chain model (dHMM) from 


Berchtoldl fjl999l ) with a left-to-right transition matrix. These three 


baselines differ from onr proposed model in that the SI and HMM models assnme independence 
of the observations y (rather than Markov dependence), and the HMM and dHMM models 
assnme geometric distribntions on the changepoint locations (rather than a negative-binomial 
distribntion). For each of the three baselines we nse a Bayesian inference procednre similar 
to that described in Section S.l in the snpplementary materials, and we report the resnlts 
for the models with the nnmber of changepoints corresponding to the highest log-probability 
score (which for all the baselines tnrned ont to be the same nnmber of changepoints as in onr 
preferred model). In addition to comparing onr model to these baseline models we also compare 
onr resnlts to those obtained with a model wher e each seqnence is analyzed independently. In 


particular, we use the single-sequence model of 


FearnheadI (120061 ) for comparison, originally 


proposed for real-valued data with independence. See the supplementary materials for these 
results. 


4.1 Synthetic Data Example: Scenario 1 

We simulated L = 10 binary sequences, all of length 200, from our proposed model. For each 
sequence I = 1,..., 10, we simulated one changepoint \ and the position of the changepoint 
was generated from the truncated negative binomial distribution as explained in Section 12.41 
using the parameters ri = 0.5 and hi = 0.8, such that each segment will have length about 100. 
The transition matrices used to generate the binary data in each segment had diagonal entries 
qi^i and ^ 2 ,*, with i denoting the segment. We used qi^i = g 2 ,i = 0.8 in the hrst segment, and 
<?i ,2 = 0.5 and ^ 2,2 = 0.4 in the second segment. 

We used our cross-validation model selection procedure described in Section [3] to determine 
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1 § 


0 12 3 

Number of changepoints, K 


SI HMM dHMM 

Baseline model 


(a) (b) 

Figure 2: Each boxplot shows the log-probability scores, across the validation sets, for a 
different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints and (b) other baseline models, minus the log-probability score for the model with 
one changepoint. 


the number of changepoints. We dehned 10 train/test sets by leaving out one of the 10 sequences 
in each fold, resulting in a “hold-one-sequence-out” cross-validation test. The result of this model 


comparison is shown in Figure 2(a) It is clear from the hgure that the model with a single 


changepoint {K = 1) is the preferred one. As the number of changepoints increases beyond a 
single changepoint the cross-validated log-probability scores become slightly worse. 

The cross-validated log-probability scores of the three baseline models relative to our model 


are shown in Figure 2(b), and as we can see our model performs the best. All models were ht 


using the optimal changepoint number [K = 1). 

We then trained our model with the correct number of changepoints {K = 1), using all 10 
sequences. The estimated parameter values with 95% credible intervals (Cl) (corresponding 
to the 2.5% maximum and minimum percentiles of the posterior samples after convergence) 
are shown in Table S.l and Table S.2 in the supplementary materials. All of the parameters 
are well estimated, although there is considerable uncertainty concerning the estimated value of 
b = bi- For a more detailed discussion of the posterior analysis for b and r see the supplementary 
materials Section S.2.1. 


Figure 3(a) shows, for each sequence, the estimated marginal probability that each obser- 
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Figure 3: The estimated marginal probabilities of (a) the classification of each observation to 
segment 2 (probability 1 is black) and (b) a particular position being a changepoint, where the 
gray scale has been adjusted for each sequence so that the position with the maximum proba¬ 
bility is black and the position with the minimum probability is white. The true changepoint 
locations are marked with (x) in both plots. Note that only position 60 to 140 are shown in 
each sequence. 


vation in that sequence belongs to segment 2. The true changepoint is marked (with an x) for 


each sequence and, as we can see, the changepoints are well recovered. In Figure 3(b) 


we see 


the marginal probability plot for the positions of the changepoints. It is worth commenting on 


the fact that the marginal probabilities in Figure 3(b) do not necessarily vary smoothly as a 
function of location. That is, there are observations that are considered to be unlikely candi¬ 
dates for a changepoint even though the previous and next observations have a high probability 
of being a changepoint. 


4.2 Real Data Analysis: Branching of Apple Trees 


In t his sectio n we a na, 
and Guedon ( 2003 1. 


yze the apple 


Guedon et ah 


t ree br anching data seta presented in iGuedon et ahl (120011) 


(120011 ) describe how the branching structure of the first 


annual shoot of the trunk can be useful as a predictor of the adult tree structure and more 
broadly how branching sequences play an important role in understanding plant development. 


^The data set is available as part of the AMAPmod software (IGodin et all 119971) available at 
http: //ope n alea .gforge. inria.fr/dokuwiki/doku.php For more details the reader is referred to 
Godin et al. ( 19991) 
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Figure 4: 20 apple tree sequences. For this figure sequences 1 to 18 were randomly chosen from 
the data set, while sequences 19 and 20 are the longest and the shortest sequences, respectively. 


As stated by 


Guedon et ah 


(1200 ll) “the sequences may be viewed as a succession of homogeneous 


zones or segments where the composition properties do not change substantially within each 


zone, but change markedly between zones.” 


We analyzed 94 data sequences related to the bra nching o 


apple trees that are left without pruning for one year flGuedonll2003l ). and 20 of these sequences 


the main trunk of two-year old 


are shown in Figure IH The main trunk consists of nodes (metamers or growth units), and 
each of these nodes can be categorized into one of five types of axillary production (or states): 
(1) latent bud, (2) 1-year-delayed short shoot, (3) 1-year-delayed-long shoot, (4) 1-year-delayed 
flowering shoot, and (5) immediate shoot. These different states correspond to nodes without 
any activity (state 1) and four different types of branching (states 2-5) . The data was rec orded 


by examining the main trunk node by node from the top to the base flGodin et ah 


199911 . The 


lengths of the resulting sequences range from 57 to 96 observations as shown in Figure HI 

Looking at the data in Figure |4] there is clearly some inhomogeneity in the observed se¬ 
quences. For example, there is a tendency to have more occurrences of state 5 (immediate 
shoots) before the middle of each sequence, and the sequences seems to be changing between 


different states more frequently at the beginning (top of the main trunk) compared to at the end 
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Figure 5: Each boxplot shows the log-probability scores, across the validation sets, for a 
different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints and (b) other baseline models, minus the log-probability score for the model with 
one changepoint. 


(base of the main trunk). However, it is not clear from visual inspection how many segments 
the data should be divided into, nor where chan gepo ints between seg ments should occur. Using 


exploratory data analysis. 


Guedon et ah 


fennih and 


GuedonI (120031) suggested partitioning the 


sequences into six segments. As an alternative we can, in a data-driven manner, simultane¬ 
ously determine the appropriate number of segments, the likely positions of these segments, 
and estimates of the Markov transition parameters within each segment. 

To hnd the optimal number of changepoints for the model we randomly partition the se¬ 
quences into ten test sets (and ten corresponding training sets). The split is done so that four 


test sets contain ten sequences and six test sets contain nine sequences. In Figure 5(a) 


we 


compare the cross-validated log probability scores of the model with one changepoint (K = 1) 
and models with zero, two, three, four, and five changepoints. We choose the K = 1 model 
as the reference in this case because it was the model with the highest median log-probability 
score, i.e., the model with a single changepoint is the preferred model. 

The cross-validated log-probability scores of the baseline models relative to the preferred 


model, are shown in Figure 5(b) Our model clearly outperforms the two baseline models using 
conditionally independent observations (SI and HMM) supporting the assumption that there 
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Figure 6: The estimated marginal probability distribution of (a) the classification of each node 
to segment 2 (probability 1 is black) and (b) a particular node being a changepoint, where the 
gray scale has been adjusted for each sequence so that the node with the maximum probability 
is black and the node with the minimum probability is white. The end position of each observed 
sequence is marked with a short vertical black line. 


is Markov dependence in the observed sequences. Our model also performs better than the 
dHMM, suggesting that the negative binomial distribution for the changepoint locations is a 
better choice for this data set than the geometric distribution used by the dHMM and HMM. 
For a comparison between our joint model and the approach where each sequence is analyzed 
independently, see the supplementary materials Section S.2.2. 

Using our proposed model with a single changepoint, we ran our sampling algorithm on 
all of the sequences in order to infer the position of the changepoint within each sequence 


and the transition matrix within each segment. Figure 6(a) shows the estimated marginal 
probability of each node being assigned to segment 2, for each of the 20 sequences from Figure 
m The estimated marginal distribution of the position of the changepoint within each of these 


sequences is shown in Figure 6(b) The results suggest that the changepoint tends to occur at 
a location roughly one quarter to one half of the length of the sequence from the start. None 
of the segments in any of the sequences where found to be skipped. 

In Figure [7] we have superposed the observed data over the estimated marginal probability 
of each node being assigned to segment 2, for three selected sequences, namely the hrst sequence 
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Figure 7: Sequence 1 (bottom), 19 (middle) and 20 (top) from FigurelHwith marginal estimates 
of the probability of each position being assigned to segment 2 (as a gray line). 


in Figure 01 the longest sequence (sequence 19) in the same hgure, and the shortest sequence 
(sequence 20). From the hgure we can see that the changepoints are all placed after the phase 
where state 5 (immed iate shoot) is f requen tly v isited. I n vestig ating the position of the six 


segments presented in 


Guedon et ah 


(1200111 and 


GuedonI (1200311 . our hrst and last segments 


correspond to the their hrst three and last three segments, respectively. 
The estimates of the transition matrices for each segment. 
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are consistent with the biology of tree growth. In segment 1, for example, the self-transition 
and transitions into state 4 (’’howering”) are much higher than in segment 2. Since segment 2 
corresponds to the lower part of the trunk, this makes sense since howering is much less likely to 
occur towards the bottom of the trunk (segment 2) relative to the top (segment 1). We can also 


observe that the marginal probability of state 1 is much higher in segment 2 than in segment 1 
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(via its self-transition and in-transition probabilities). State 1 is the non-branching state, so it 
is biologically reasonable to expect less branching in the lower part of the trnnk (segment 2). 
The estimate for the changepoint parameters ri and bi are 0.416 and 0.652, respectively, with 
95% credible interval [0.389,0.442] and [0.456,0.867], respectively. 


4.3 Real Data Analysis: Monsoon Rainfall 


To illnstrate the applicability of the model to a real-world problem we analyze the Indian 
monsoon data described earlier in Section 1. The onset and withdrawal of the a nn ual snmmer 
monsoon is of critical importance in India since it directly i mpacts agricnltnral p rodnction, 


water resonrce management, and hydroelectricity prodnction flLima and Lall 


2009 ). There is 


no precise dehnition of the monsoon season, bnt there is a general nnderstanding that the 


onset is the time of consistent and snbstantial increase in rainfal 


the withdrawal is the time that marks the retnrn to a dry period (IFasnllo and Webster 


over a regional area anc 


2003 


Joseph et ah 


20061) . In terms of nnderstanding the climatological variability of the monsoon over 


time, a first step is to label the changepoints of the onset and withdrawal in the historic record 


fjjoseph et ah 


20061) . enabling (for example) prediction of onset anc 


of exogenons variables snch as large-scale atmospheric qnantities 


withdrawal as a fnnction 


Pai and Nair 


20091) . Onr 


proposed model provides a framework to not only detect bnt also to qnantify onr nncertainty 
abont the estimation of the onset and withdrawal dates and onr resnlts can be viewed as an 


alternative to other a pproaches that nse non-Markov models for this pnrpose (e.g.. 


Lima and Lalll (2009)) 


SternI ( 198211 : 


Finite-state Markov model s have long been nsed to c 


in climatological applications flGabriel and Neumann 


l aracterize d aily rainfall occnrrence 


1962 


Katz 


19741) . Rainfall occnrrence is 


typically defined as any amonnt more than 0.2mm or O.Olin. A second cntoff can be used to 
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distinguish light and heavy rainfall events (anything over 20nini is considered a heavy rainfall 
event). In our analysis below the daily rainfall amount is assigned to one of three discrete 


categories: no/light rainfall with 0-0■2 mm (m = 1 


or heavy rainfall with >20mm [nj = 3) flKatz 


1977h 


, medium rainfall with 0.2-20mm (j/j = 2), 


Our data set consists of daily rainfall measurements over a 31 year period (1979-2010) from 
a weather station located at latitude 24.65 and longitude 77.32 in the monsoon region of Indiac. 
The data is plotted in Figure [T] with each year plotted as a sequence. Rainfall amounts were 
categorized into three states as described above and the 153 missing observations (1.35% of the 
data) were imputed during inference as described in Section S.l in the supplementary materials. 
We denote each day of the year as j = 1,..., 365 (leap days are removed). We assume that the 
hrst and last segments in each year have the same Markov transition matrix, reflecting the fact 
that the end of one sequence on December 31 is contiguous in time with the start of the next 
sequence on January 1. This constraint rules out the possibility of a single changepoint in the 
model, and thus, the number of possible changepoints we can consider is K=0, 2, 3, 4, .... 

To determine the optimal number of changepoints using cross-validation we randomly parti¬ 
tioned our sequences (or years) into 10 training/test sets where nine of the test sets contain three 


sequences and one has four sequences. Figure 8(a) compares the cross-validated log probability 
scores of the model with two changepoints, to models with zero, three, and four changepoints. 
The models with two and three changepoints are very close in performance and outperform 
models with zero or four changepoints. We select the model with two changepoints {K = 2) as 
our preferred model given that it is simpler and it corresponds to our physical intuition about 
the monsoon phenomenon (i.e. onset and withdrawal of the monsoon season). 


^The data were obtained from the U.S. National Centers for Environmental Prediction 
(NCEP) Climate Prediction Center (CPC) Global Summary of the Day (GSOD) Observations, 
http://rda.near.edu/datasets/ds512.0 
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Figure 8: Each boxplot shows the log-probability scores, across the validation sets, for a 
different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints and (b) other baseline models, minus the log-probability score for the model with 
two changepoints. 


Figure 8(b) compares the model with two changepoints with the baseline models SI, HMM, 
and dHMM. As we can see, our proposed model significantly outperforms the three baselines. 
A comparison of the joint model and the approach based on individual sequences is provided 
in Section S.2.3 in the supplementary materials. 


Figure 9(a) shows the estimates of the marginal probability of each day belonging to the 
monsoon season, providing a visual interpretation of the estimated interannual variability in the 
dates of the Indian monsoon onset and withdrawal across 31 years. The marginal distribution 
across all 31 years is shown at the top of the hgure with each of the years plotted per row below. 
Most days in July and August, across all years, are highly probable to be classihed as monsoon 
days (black) and the days at the beginning and end of the monsoon season have less certainty 
of being in the monsoon (gray). 

The probability of each individual day being a changepoint (onset or withdrawal of the 


monsoon) can also be obtained from the model. Figure 9(b) shows the probability of a particular 


day being a changepoint in each of the 31 years. Days that are darker gray are more likely to 
be the monsoon onset or withdrawal date. The top line shows the marginal over all years. We 
see that the highest probability of onset is in June and the highest probability of withdrawal is 
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Figure 9: The estimated marginal probability distribution for (a) the classification of each day 
to the monsoon season (probability 1 is black) and (b) a particular day being a changepoint, 
where the gray scale has been adjusted for each year so that the day with the maximum 
probability is black and the day with the minimum probability is white. 


in September. 

The estimated transition matrices for our model with two changepoints, the first for the dry 
(non-monsoon) season and the second for the monsoon season are as follows 


0.916 

0.026 

0.002 


0.720 

0.216 

0.069 

0.742 

0.232 

0.027 

^monsoon _ 

0.439 

0.436 

0.125 

0.423 

0.367 

0.210 


0.258 

0.489 

0.253 


with rows/columns 1, 2, 3, corresponding to no rainfall, light rainfall, and heavy rainfall cate¬ 
gories respectively. The monsoon and non-monsoon seasons are distinctly different in terms of 
their Markov transition probabilities, with transitions into light and heavy rainfall categories 
being signihcantly higher in the monsoon season compared to the dry season. The estimates for 
the changepoint parameters ri, 6 i, r 2 and 62 are 0.446, 0.810, 0.703 and 0.885, respectively, with 
95 % credible intervals [0.431, 0.461], [0.550, 0.985], [0.687, 0.719] and [0.693, 0.993], respectively. 

Upon examining the model with three changepoints {K = 3), we found it added an ad¬ 
ditional changepoint in the middle of the monsoon season. However the estimated transition 
matrices for these two segments did not differ signihcantly from one anot 

Our results are broadly consistent with other hndings. For example. 


aer. 


Fasullo and Webste: 
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(120031) analyzed data from a station further south in India, estimating that the monsoon onset 


dates at this location range from May 19 to June 20 with the median being June 6, and the 
withdrawal dates range from August 12 to September 27 with the median being September 
4. These dates are consistent with our estimates at a more northerly location, given that the 
monsoon approaches from the south in India, and thus, we would expect our calculated onset 


and withdrawal dates to be somewhat later than those of 


Fasullo and Websterl (120031) . We 


obtain a mean estimate of June 13 for the monsoon onset with 95% probability interval (PI) 
of [May 23, July 4] and the mean date for the monsoon withdrawal is September 13 with an 
estimated 95% PI of [August 24, October 3]. These estimates are based on the data shown at 


the top of Figure 9(b) where the 95% PI is computed by summing up the estimated marginal 


probabilities symmetrically around the mean until a 95% interval is reached. 


5. DISCUSSION AND CONCLUSIONS 

We introduced a piecewise homogeneous Markov chain model where changepoint positions are 
modeled by a discrete distribution, in particular, a truncated version of the negative binomial 
distribution. The model is constructed to handle multiple sequences of variable length where 
each sequence moves between the underlying Markov chains in the same order. We show 
that the changepoints are well recovered on synthetic data, resulting in accurate estimates for 
the parameters used to dehne our model. To illustrate the utility of the model on real-world 
data sets we applied the model to an apple tree branching data set and to daily rainfall data 
collected for 31 years in Northern India. In the apple tree data the model suggests that a single 
changepoint for each sequence is best, and for the rainfall data the model is able to detect 
the onset and withdrawal of the monsoon season. In both cases the proposed model produced 
inferences of parameters and changepoint locations that were scientihcally interpretable. In 
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Data set 

K = 0 

K= 1 

K = 2 

K = 3 

Simulated Data 1 

0.002 

0.015 

0.065 

0.138 

Monsoon Data 

0.054 

0.163 

0.432 

0.795 


Table 1: Computational time per iteration in seconds for two of the data sets discussed in the 
paper, with code in Matlab on a computer with 2.80 GHz CPU. 


addition the model systematically outperformed alternative approaches such as hidden Markov 
and double-hidden Markov models. 

Our Bayesian framework allows different sequences to draw strength from each other both 
when hnding the changepoints and for parameter estimation. Estimates of uncertainty about 
both parameters and latent variables, which arise naturally from our MCMC inference algo¬ 
rithm, provide an appealing interpretation of the uncertainty regarding position of the change- 
points. This is of particular interest for example in analyzing the onset and withdrawal of 
the Indian monsoon season. Uncertainty quantihcation is an increasingly important com¬ 
ponent of climate data analysis, and the type of Bayesian approach used in this paper can 

.ternative to more traditional methods such as using thresh- 


old values ( 

Lima anc 

Ball 

2009) 

(Fasullo and Webster 

2003 

)• 


20091) or relying entirely on dehnitions based on prior knowledge 


Comparisons of the computation times for two of the data examples in the paper are shown 
in Table [H comparing the times for the smallest data set (the hrst simulated data set) and the 
largest data set (the monsoon data). The numbers shown are for one MCMC iteration when 
htting the model to the full data set (we used parallelization for our cross-validation runs). We 
see that as the number of changepoints increase the method is more computationally expensive, 
but overall the method is relatively fast even on the larger data set. Additional speedups could 
be obtained (for example) by parallelizing the analysis across sequences, coding the algorithm 
in a more efficient language than Matlab, and so on. 

There are a number of additional extensions to the model that may be potentially useful 
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to explore. For example, a useful direction would be the development of conditional models 
p{y\x), where each sequence y = {yi,, yr) is accompanied by a sequence x = (a^i,..., xt) 
of exogenous variables and where each Xt could be multivariate, t = 1... ,T. The exogenous 
variables x could influence the y^s directly and/or the locations of the changepoints. Another 
direction would be models for handling high-dimensional and/or real-valued observation se¬ 
quences, where one could assume the existence of a categorical latent sequence z = {zi,..., zt) 
as a low-dimensional representation of the observed sequence data y, extending the general 
approach presented here to piecewise homogeneous hidden Markov chains. 
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Supplementary materials to the paper: 
Bayesian Detection of Changepoints in 
Finite-State Markov Chains for Multiple 

Sequences 


Fetter Arnesen, Tracy Holsclaw and Padhraic Smyth 1 

S.l. SAMPLING ALGORITHM FOR POSTERIOR INFERENGE 


For notational convenience we assume in this section that the number of observed sequences 
is L = 1. The generalization to sampling with multiple sequences is straightforward given the 
assumption of conditionally independent sequences from Section 2.4. We use the Metropolis- 
Hasting algorithm in all of our sampling updates. 

The changepoints are sampled by proposing a new position fj = r* -|- 1 or fi = r* — 1, both 
with probability 0.5, with obvious adjustments if there are zero-length segments involved in the 
update. Letting r = (ri,..., r*,..., tk) be the current segmentation and r = (ri,..., fj, be 

the proposed new segmentation, this yields the following acceptance probability 



where q{f\T) is the probability of proposing r given that the current state is r. 

To update r* we adopt the following strategy. First we propose a new value fi from a proposal 
distribution of the form (fj, 1 — ri)\ri ~ Dir{ar{ri, 1 — rj)), where a,, is a tuning parameter that 
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controls how close the proposed value r* will be to the current value rj. The same strategy is 
used when proposing new values for hi, using a separate tuning parameter ah- To update the 
values in the transition matrix, we propose to change one of the n rows. For example, 
is updated by a proposal distribution of the form ~ Dir{aQQ^^), where aq is a 

tuning parameter controlling the acceptance rates for such updates. For the results reported 
in this paper we set tuning parameter values to ar = 1000, at = 100, and ag = 1000. These 
values were chosen to obtain fast convergence and good mixing of the MCMC chain, and we 
found that the algorithm is not particularly sensitive to these values. In particular, a hve-factor 
change in these values does not signihcantly effect the sampling. 

We dehne one iteration of our algorithm to be a single proposed update of hi and r* for 
one value of f = 1,..., iF, a single proposed update for one of the rows in one of the transition 
matrices, and V proposed updates of each of the changepoints in each sequence. We found 
empirically that using V > 1 updates leads to faster convergence and better mixing of the 
MCMC sampler (we used V = 5 for the results in this paper). Results are obtained from 
100,000 iterations of the MCMC chain after removing a burn-in period of 10,000 iterations. In 
addition, we do thinning by using only every 100th iteration in order to obtain independent 
samples. 

In practice it is common to have missing data (for example for the rainfall data discussed in 
the paper) motivating the use of a systematic approach for handling such missing observations. 
We marginalize over missing observations in our sampling algorithm whenever we need to 
evaluate the data li kelihood in (3). The c alculation is carried out using a version of the recursive 


algorithm given in 


Friel and Rud (120071) . Let m = {j\yj is missing}, and let y* denote the 


actual value of a non-missing variable yj. We write p{yj = y*\yj-i,T, Q) when evaluating the 
probability of that event. When marginalizing over missing values the data likelihood of the 
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non-missing values y-m becomes 


p(y-m|T,Q,2/o) = ^ p(y|T,Q,|/o). (8) 

yj-.jem 

To evaluate this let ^ 0 ( 2 / 1 ) = PiVilVo), cind for j = 1, ...,T — 1 we define recursively 




piVj+ilVj = y*)zj_i{yj = y*) 

< 

T.y^piyj+i\yj)zj-iiyj) 


yj is not missing 
yj is missing. 


To complete the calculation we hnd 


p{y-m = y*_JQ,T,yo) 


ZT-iiyr = 2 / t ) 

< 

Yjyr ^T-liyr) 


yx is not missing 
yx is missing. 


S.2. ADDITIONAL POSTERIOR RESULTS FOR THE 


EXAMPLES IN THE PAPER 


S.2.1 Synthetic data: Scenario 1 


Parameters b and r are correlated with one another thus making their values interdependent. 
Because of this interdependence we investigate below the marginals and the sensitivity of the 
model relative to these parameters. Both parameters are well estimated, although there is 
considerable uncertainty concerning the estimated value of b (see Table IS.ip . Figure IS.Uf a) 
shows, however, that the distribution of the changepoints, p{t\T = 200, r = 0.5,6), is not 
particularly sensitive to the value of b within the credibility interval. The parameter r primarily 


controls the position of the changepoints. Sensitivity analysis for this parameter is shown in 
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Trne Est. 

95 % Cl 

ri 

0.5 0.458 

0.406,0.513 


hi 

0.8 0.615 

0.444,0.927 



Table S.l: Parameter estimate and 95% CL 




(a) (b) 

Figure S.l: (a) Sensitivity of the probability distribntion p{t\T = 200, r = 0.5,6). The solid 
line corresponds to 6 = 0.8 (trne valne), dashed line to 6 = 0.615 (estimated valne), and the 
dotted lines correspond to 6 = 0.444 and 6 = 0.927 (limits of the 95% Cl). The dash-dotted line 
corresponds to 6 = 0.1. (b) Sensitivity of the probability distribntion p{t\T = 200, r, 6 = 0.8). 
The solid line corresponds to r = 0.5 (trne valne), dashed line to r = 0.458 (estimated valne), 
and the dotted lines correspond to r = 0.406 and r = 0.513 (limits of the 95% Cl). The 
dash-dotted line corresponds to r = 0.8. 



Number of changepoints, K 


Number of changepoints, K 


Number of changepoints, K 


(a) (b) (c) 

Figure S.2: Each boxplot shows the log-probability scores, across the validation sets, for 
a different model. The y-axis is defined as the log-probability score for other nnmbers of 
changepoints minns the log-probability score for the model with one changepoint, and the three 
plots correponds to different nnmber of seqnences L\ (a) L=5, (b) L=20 and (c) L=100. 


Fignre ISdT bL The relatively high posterior nncertainty for these two parameters is somewhat 
dne to the small sample size of only 10 seqnences. We also generated a data set with L = 100 
seqnences and fonnd that the estimate for 6 is 0.712 with a 95% Cl of (0.555,0.902), and as 
expected the extra seqnences improve the accnracy and precision of the estimate. 

We also varied L in this scenario to test the sensitivity of the model to the nnmber of 
seqnences. We generated data sets with L = 5, L = 20, and L = 100 and display the resnlts in 
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Figure IS.21 For the data sets with L = 20 and L = 100 the results are similar to Figure 2(a) 
with L = 10. For L = 5 the cross-validation test tends to prefer the correct solution [K = 1) 
(where we generated 15 folds in total, comprised of validation sets with either a single sequence 
or a pair of sequences), although K = 0 (no changepoint) is preferred in some cases, likely due 
to limited availability of data to support a more complex model. 

In addition to comparing our model to the baseline models presented in the paper we also 
compare our results to those obtained with a model where each sequence is analyzed indepen¬ 
dently. In particular, we use the single-sequence model of Fearnhead (2006) for comparison, 
originally proposed for real-valued data with independence. We adapt this model by condition¬ 
ing on the number of changepoints and extending to a Markov assumption. In the Fearnhead 
model, the prior assumes a geometric distribution to the length of each segment. By hxing 
the number of changepoints this simply becomes a uniform distribution on the position of the 
changepoints. Note that this is a special case of our model applied to single sequences when 
bi approaches zero for all i. Figure ISTSl shows the results from running this model with a hxed 
(correct) number of changepoints, and the results can be compared to those of Figure 3 in the 
paper. Comparing these hgures illustrates the strength of our joint modeling approach. Since 
our model is able to draw strength across sequences we achieve more accurate estimates of the 
position of the changepoints compared to the single sequence model. Similar results can be 
shown for the estimate of the transition matrices in each of the segments (see Table IS^ . In 
this table we clearly see that the estimates for the transition probabilities for the individual 
sequences have a much higher uncertainty than the estimates for our joint model approach 
(shown in the last row). 
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(a) (b) 

Figure S.3: The estimated marginal probabilities for individual sequence model of (a) the 
classihcation of each observation to segment 2 (probability 1 is black) and (b) a particular 
position being a changepoint, where the gray scale has been adjusted for each sequence so 
that the position with the maximum probability is black and the position with the minimum 
probability is white. The true changepoint locations are marked with (x) in both plots. Note 
that only position 60 to 140 are shown in each sequence. 


Seq. 

gi.i(95 %CI) 

?2,1 

(95 %CI) 

?1.2 

(95 %CI) 

^2.2 

(95 %CI) 

1 

0.856 

,0.757,0.907, 

0.711 

,0.580,0.824, 

0.462 

(0.351,0.582, 

0.190 

,0.113,0.300) 

2 

0.770 

0.654,0.840 

0.725 

0.608,0.829 

0.392 

0.295,0.497 

0.389 

0.284,0.506 

3 

0.842 

0.751,0.910 

0.760 

,0.628,0.876 

0.504 

0.393,0.614 

0.430 

,0.323,0.530) 

4 

0.820 

0.722,0.895 

0.844 

(0.738,0.928 

0.601 

0.446,0.717 

0.447 

(0.316,0.565 

5 

0.764 

0.682,0.849 

0.739 

0.645,0.816 

0.506 

0.389,0.646 

0.484 

0.392,0.620 

6 

0.807 

0.689,0.890 

0.701 

0.576,0.798 

0.565 

(0.464,0.658 

0.393 

,0.295,0.504) 

7 

0.866 

0.777,0.928 

0.713 

0.557,0.840 

0.475 

(0.366,0.606^ 

0.371 

(0.276,0.483 

8 

0.802 

0.690,0.884 

0.842 

0.741,0.917 

0.651 

'0.5440,0.746) 0.326 

0.216,0.472 

9 

0.846 

0.746,0.898 

0.621 

0.395,0.763 

0.513 

0.390,0.629) 

0.480 

0.340,0.606 

10 

0.811 

T.668,0.929^ 

0.885 

T.789,0.947^ 

0.512 

(0.395,0.627^ 

0.379 

T.271,0.475) 

Joint 

0.829 

,0.792,0.852, 

0.772(0.742,0.810) 

0.514 

(0.475,0.550, 

0.382(0.344,0.423) 


Table S.2: Parameter estimate and 95% Cl for the transition probabilities estimated based 
on individual sequences (Seq. 1-10), and for our joint model approach (Joint). 


S.2.2 Real Data Analysis: Branching of apple Trees 

In this section we present results from running the individual sequence approach on the apple 
tree data set. In Figure [S31 we show the estimates for the position of the changepoints using 
a model with K = 1, and this is to be compared with the result from our joint model shown 
in Figure 6 in the paper. From these hgures we see that the individual sequence approach ap¬ 
proximately hnds the same changepoints as the joint model approach, although with somewhat 
higher uncertainty. Note that for some of the sequences, for instance sequences 5 and 12 in 
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(a) (b) 

Figure S.4: The estimated marginal probability distribution of (a) the classification of each 
node to segment 2 (probability 1 is black) and (b) a particular node being a changepoint, 
where the gray scale has been adjusted for each sequence so that the node with the maximum 
probability is black and the node with the minimum probability is white. The end position of 
each observed sequence is marked with a short vertical black line. 


Figure [Sl4l the individual sequence approach converges into a state with a zero length segment, 
which is somewhat surprising as there seems to be no significant difference in the structure of 
these two sequences compared to the others (see Figure 4 in the paper). This result is not 
present in our joint approach due to the sharing of information between the sequences. Also, 
because this is categorical data with hve states the uncertainty in the estimates of the transi¬ 
tion probabilities for the individual sequence approach becomes quite high due to lack of data, 
compared to our joint approach. 


S.2.3 Real Data Analysis: Monsoon Rainfall 

We also applied the individual sequence approach to the rainfall data presented in Section 4.3. 
in the paper. We assumed for this experiment that K = 2 and the results are shown in Figure 
IS.51 This hgure clearly shows very high uncertainty regarding the position of the changepoints, 
and in most of these cases these changepoints are completely unrelated to the actual onset and 
offset of the monsoon season. This hgure should be compared to the results using our joint 
approach shown in Figure 9 in the main paper. 
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(a) (b) 

Figure S.5: The estimated marginal probability distribution for (a) the classihcation of each 
day to the monsoon season (probability 1 is black) and (b) a particular day being a changepoint, 
where the gray scale has been adjusted for each year so that the day with the maximum 
probability is black and the day with the minimum probability is white. 


S.3. ADDITIONAL SYNTHETIC AND REAL DATA ANALYSIS 


EXAMPLES 


S.3.1 Synthetic Data Example: Scenario 2 


In this scenario, we generate binary data from our model for L = 30 sequences with length 
Ti = 200. The first 20 sequences are generated with two changepoints and the last 10 sequences 
are generated with only one changepoint. The 10 sequences with one changepoint are generated 
using the same parameters as described in Section 4.1 in the paper. The 20 sequences with two 
changepoints have similar hrst and last segments (as in Section 4.1) plus an additional middle 
segment where (^ 2 , 62 ) = (0.75,0.8) and qi ^2 = ^ 2,2 = 0.2. 

We used 10-fold cross validation to find the optimal number of changepoints. In this scenario 
K = 2 effectively represents the true maximal number of changepoints in any sequence, where 
sequences with fewer changepoints can skip the “extra” segments. Thus, the interpretation of a 
single “overall best K" has less meaning in this context. Nonetheless, using K = 2 changepoints 


as the reference. Figure S, 6 (a) shows that two changepoints {K = 2) are preferred using the 
cross-validation score, which is consistent with the fact that 20 out of the 30 sequences were 


generated with two changepoints. Figure S, 6 (b) shows that our model outperforms the other 


38 























0 12 3 

Number of changepoints, K 


SI HMM dHMM 

Baseline model 


(a) (b) 

Figure S.6: Each boxplot shows the log-probability scores, across the validation sets, for a 
different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints and (b) other baseline models, minus the log-probability score for the model with 
two changepoints. 
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Segment 4 
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(c) (d) 

Figure S.7: Distribution of the length of each segments in each sequence in the K = 3 solution. 
a)-d) are segments 1 to 4 respectively. Each sequence has a box plot for the posterior estimate 
of the length of its segment. Eor simplicity we show the expected length of each segment as 
a solid gray line, noting that for each simulated sequence the actual segment length will vary 
around this expected length. 


baseline models, all fit with K = 2. 

To illustrate how our model skips segments, we examine in detail the solutions found by the 
model when it is £t to all 30 sequences and allowed to use K = 3 changepoints, even though the 


39 

























Est. 

ri 

0.382 

bi 

0.618 

r2 

0.412 

^2 

0.858 

rs 

0.732 

h 

0.004 


0.779 

? 2,1 

0.803 

? 1,2 

0.722 

Q2,2 

0.708 

? 1,3 

0.216 

? 2,3 

0.184 

? 1,4 

0.496 

5 ' 2,4 

0.369 


95 % Cl 

0.225,0.512 

'0.242,0.832 

0.245,0.538 

'0.510,0.901 

0.556,0.904 

0 . 001 , 0.012 

'0.768,0.804 

'0.786,0.821 

'0.365,0.846 

0.322,0.834 

0.194,0.239 

0.156,0.209 

'0.443,0.546 

'0.312,0.422' 


Table S.3: Parameter estimate and 95% Cl. 


data was generated with only K = 1 oi K = 2 changepoints. Figure ISTl shows the resnlts of 
this experiment, with each of the fonr panes corresponding to one of the fonr possible segments 
that the model conld infer. The estimated posterior distribntion of the length of each segment 
is shown in the form of a box plot (the |/-axis) for each of the 30 seqnences (the x-axis). Recall 
that seqnences 1 to 20 were generated with K = 2 changepoints and seqnences 21 to 30 with 
K = 1 changepoint. The expected length of each segment given the trne changepoint model 
is shown as a gray line. From the hgnre we see that the second segment (Fignre IS.7( b)) is 
skipped for all seqnences, i.e., the model finds no evidence for an additional segment. Similarly, 
in Fignre IS.7f c) we see that seqnences 21 to 30 infer very short or zero length segments for 
the third segment, and as a conseqnence the fonrth segment is mnch longer in these seqnences 
(Fignre [S3(d)). Thns, even with misspecification of the valne of K, and when seqnences have 
variable nnmbers of changepoints, onr model appears to perform well. 

For the posterior distribntions of the parameters, we see some snmmary statistics in Table 
IS.31 For these statistics we find many similarities with the trne parameters of the data. For 
instance, the 95% CIs of ri and bi inclnde the trne parameter of the first changepoint for both 
types of seqnences. Next, r 2 and &2 correspond to the changepoint leading np to the first (often) 
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zero length segment, so it is reasonable that r 2 is close to ri. Next, we see a high uncertainty 
in the position of the third changepoint, 63 = 0.004, which makes sense because this is the 
changepoint leading up to the third changepoint, which is approximately of length 50 for the 20 
hrst sequences and (often) zero for the last 10 sequences. For the transition matrix parameters 
we see high uncertainty for the estimates of qi ^2 and ^ 2,2 which makes sense because this segment 
is (often) zero for all the sequences. All other transition matrix parameters are well recovered. 

We also investigated both the K = 2 and the K = 1 solution for this data set. For the 
K = 2 case, one segment is skipped for the last 10 sequences, as expected. For the K = 1 
solution the position of the single changepoint in the last 10 sequences gets well estimated, 
while in the hrst 20 sequences the position of the hrst changepoint is the one that is recovered 
accurately and the model merges the second and third segments. 

S.3.2 Synthetic Data Example; Scenario 3 

In this section we present an experiment similar to what is done in the previous section (Section 
S.3.1), however we switch the number of sequences with K = 1 changepoints with the number 
of sequences with K = 2 changepoints. In Figure ESI we see the result from our cross validation 
procedure for diherent numbers of changepoints, and with comparisons to the baseline models. 
We observe iF = 2 to be the superior choice. This is slightly surprising considering that the 
example in Section S.3.1 gave a less signihcant difference between the K = 1 and K = 2 solution 
even though in that case there were more sequences with K = 2 changepoints. We believe that 
this is likely to result from differences between these two random data sets. 

In Figure ED we see the posterior estimation of the length of each segments when we assume 
a model with K = 3 changepoints, again a similar procedure as in Section S.3.1. Again the 
zero length segments can be observed. 
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Figure S.8: Each boxplot shows the log-probability scores, across the validation sets, for a 
different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints and (b) other baseline models, minus the log-probability score for the model with 
two changepoints. 
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Figure S.9: Distribution of the length of each segments in each sequence in the K = 3 solution. 
a)-d) are segments 1 to 4 respectively. Each sequence has a box plot for the posterior estimate 
of the length of its segment. For simplicity we show the expected length of each segment as 
a solid gray line, noting that for each simulated sequence the actual segment length will vary 
around this expected length. 


S.3.3 Synthetic Data Example: Scenario 4 and 5 


In this section we briefly present an experiment that illustrates the performance of our procedure 

when the number of possible categories for yj increases, i.e. the dimension of increases. In 
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Figure S.IO: Each boxplot shows the log-probability scores, across the validation sets, for 
a different model. The y-axis is dehned as the log-probability score for (a) other numbers of 
changepoints for scenario 4 and (b) other numbers of changepoints for scenario 5. 


particular we keep the changepoints simulated for the 10 sequences in Section 4.1 in the paper, 
but simulate instead y, for scenario 4, by using the following transition matrices 


■ 0.8 

0.1 

0.1 ■ 


■ 0.333 

0.333 

0.333 ■ 

0.1 

0.8 

0.1 


0.350 

0.300 

0.350 

0.1 

0.1 

0.8 


0.375 

0.375 

0.250 


and for scenario 5 the matrices 



■ 0.800 

0.033 

0.033 

0.033 ■ 


■ 0.250 

0.250 

0.250 

0.250 ■ 

= 

0.033 

0.033 

0.800 

0.033 

0.033 

0.800 

0.033 

0.033 


0.267 

0.283 

0.200 

0.283 

0.267 

0.150 

0.267 

0.283 


. 0.033 

0.033 

0.033 

0.800 . 


. 0.300 

0.300 

0.300 

0.100 . 


which are chosen to somewhat imitate the transition matrices in Section 4.1 but now with 3 
and 4 categories, respectively. Figure [STO] shows the result when running our cross-validation 
procedure on these two data sets. In both of these two multi-category examples we are able to 
detect the correct number of changepoints. We next investigate the distribution of the position 
of the changepoints under these two scenarios, see Figure IS.Ill Comparing these results to 
Figure 3 in the paper, we see that the uncertainty regarding the position of the changepoints 
increases due to less data, although the results are still relatively accurate. Similar results can 
be seen in the estimates for the model parameters Q^‘^\ ’"i and hi (not shown). 


43 








































(c) (d) 

Figure S.ll: The estimated marginal probability distribution of (a) the classihcation of each 
node to segment 2 (probability 1 is black) for scenario 4 and (b) a particular node being a 
changepoint for scenario 4, where the gray scale has been adjusted for each sequence so that 
the node with the maximum probability is black and the node with the minimum probability 
is white. In (c) and (d) similar results are shown for scenario 5 respectively. 
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